Genre Classification of Web Documents

نویسندگان

  • Elizabeth Sugar Boese
  • Adele E. Howe
چکیده

Retrieving relevant documents over the Web is an overwhelming task when search engines return thousands of Web documents. Sifting through these documents is time-consuming and sometimes leads to an unsuccessful search. One problem is that most search engines rely on matching a query to documents based solely on topical keywords. However, many users of search engines have a particular genre in mind for the desired documents. The genre of a document concerns aspects of the document such as the style or readability, presentation layout, and meta-content such as words in the title or the existence of graphs or photos. By including genre in Web searches, we hypothesize that Web document retrieval could greatly improve accuracy by better matching documents to the user’s information needs. Before implementing a search engine capable of discriminating on both genre and topic, a feasibility analysis of genre classification is needed. Our previous research achieved 91% classification accuracy across ten genres, whereas similar research range between 60 and 85% accuracy. However, the ten genres used in our research were mostly distinct and only exemplar Web documents (consisting of only one genre) were chosen. This paper discusses our current work which involves an in-depth analysis of maintaining high accuracy rates among genres that are very similar.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Thesis Stereotyping the Web: Genre Classification of Web Documents

OF THESIS STEREOTYPING THE WEB: GENRE CLASSIFICATION OF WEB DOCUMENTS Retrieving relevant documents over the Web is a difficult task. Currently, search engines rely on keywords for matching documents to user queries. This paper explores the potential for discriminating documents based on the genre of the document. I define genre as a taxonomy that incorporates the style, form and content of a d...

متن کامل

Genre Classification of Web Pages

Genre classification means to discriminate between documents by means of their form, their style, or their targeted audience. Put another way, genre classification is orthogonal to a classification based on the documents’ contents. While most of the existing investigations of an automated genre classification are based on news articles corpora, the idea here is applied to arbitrary Web pages. W...

متن کامل

Multiple sets of features for automatic genre classification of web documents

With the increase of information on the Web, it is difficult to find desired information quickly out of the documents retrieved by a search engine. One way to solve this problem is to classify web documents according to various criteria. Most document classification has been focused on a subject or a topic of a document. A genre or a style is another view of a document different from a subject ...

متن کامل

The Impact of Noise in Web Genre Identification

Genre detection of web documents fits an open-set classification task. The web documents not belonging to any predefined genre or where multiple genres co-exist is considered as noise. In this work we study the impact of noise on automated genre identification within an open-set classification framework. We examine alternative classification models and document representation schemes based on t...

متن کامل

Effectiveness of web search results for genre and sentiment classification

The motivation of this study is to enhance general topical search with a sentiment-based one where the search results (called snippets) returned by the Web search engine are clustered by sentiment categories. Firstly we developed an automatic method to identify product review documents using the snippets (summary information that includes the URL, title, and summary text), which is considered a...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2005